Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text promts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures.
translated by 谷歌翻译
近年来,自我监督学习(SSL)已广泛探索。特别是,生成的SSL在自然语言处理和其他AI领域(例如BERT和GPT的广泛采用)中获得了新的成功。尽管如此,对比度学习 - 严重依赖结构数据的增强和复杂的培训策略,这是图SSL的主要方法,而迄今为止,生成SSL在图形上的进度(尤其是GAES)尚未达到潜在的潜力。正如其他领域所承诺的。在本文中,我们确定并检查对GAE的发展产生负面影响的问题,包括其重建目标,训练鲁棒性和错误指标。我们提出了一个蒙版的图形自动编码器Graphmae,该图可以减轻这些问题,以预处理生成性自我监督图。我们建议没有重建图形结构,而是提议通过掩盖策略和缩放余弦误差将重点放在特征重建上,从而使GraphMae的强大训练受益。我们在21个公共数据集上进行了大量实验,以实现三个不同的图形学习任务。结果表明,Graphmae-A简单的图形自动编码器具有仔细的设计-CAN始终在对比度和生成性最新基准相比,始终产生优于性的表现。这项研究提供了对图自动编码器的理解,并证明了在图上的生成自我监督预训练的潜力。
translated by 谷歌翻译
由于它们对处理图形结构数据的显着功率,图表卷积网络(GCNS)已广泛应用于各个领域。典型的GCN及其变体在同声源性假设下工作(即,具有相同类的节点容易彼此连接),同时忽略许多真实网络中存在的异源性(即,具有不同类别的节点倾向于形成边缘) 。现有方法通过主要聚集高阶邻域或梳理即时表示来处理异常的方法,这导致结果导致噪声和无关的信息。但这些方法没有改变在同性恋假设下工作的传播机制(这是GCN的基本部分)。这使得难以区分不同类别的节点的表示。为了解决这个问题,在本文中,我们设计了一种新的传播机制,可以根据节点对之间自动或异常改变传播和聚合过程。为了自适应地学习传播过程,我们在节点对之间引入两个奇妙程度的两个测量,这分别基于拓扑和属性信息来学习。然后,我们将学习的同音源于Graph卷积框架纳入图形卷积框架,该框架在端到端的架构中培训,使其能够超越奇妙的假设。更重要的是,我们理论上证明我们的模型可以根据他们的同意程度来限制节点之间的表示的相似性。 7个现实世界数据集的实验表明,这种新方法在异常或低意识下表现出最先进的方法,并在精梳性下获得竞争性能。
translated by 谷歌翻译
多标签文本分类是指从标签集中分配其最相关标签的问题。通常,在现实世界应用中提供给定文件的元数据和标签的层次结构。然而,大多数现有的研究专注于仅建模文本信息,几次尝试利用元数据或层次结构,而不是它们都是。在本文中,我们通过在大型标签层次结构中正式化Metadata感知文本分类问题来弥合差距(例如,数万个标签)。为了解决这个问题,我们介绍了匹配解决方案 - 一个端到端的框架,它利用元数据和层次结构。为了合并元数据,我们预先培训了同一空间中的文本和元数据的嵌入,并且还利用完全连接的关注来捕获它们之间的相互关系。要利用标签层次结构,我们提出了不同的方法来规范其父母每个子标签的参数和输出概率。在具有大规模标签层次结构的两个大规模文本数据集上的广泛实验证明了匹配最先进的深度学习基线的有效性。
translated by 谷歌翻译
同态加密(HE),允许对加密数据(Ciphertext)进行计算,而无需首先解密,因此可以实现对云中隐私性的应用程序的安全性缓慢的卷积神经网络(CNN)推断。为了减少推理潜伏期,一种方法是将多个消息打包到单个密文中,以减少密文的数量并支持同型多态多重蓄能(HMA)操作的大量并行性。尽管HECNN的推断速度更快,但主流包装方案密集的包装(密度)和卷积包装(Convpack)仍将昂贵的旋转开销引入了昂贵的旋转开销,这延长了HECNN的推断潜伏期,以实现更深和更广泛的CNN体​​系结构。在本文中,我们提出了一种名为FFCONV的低级分解方法,该方法专门用于有效的密文填料,用于减少旋转台面和HMA操作。 FFCONV近似于低级分解卷积的A D X D卷积层,其中D X D低率卷积具有较少的通道,然后是1 x 1卷积以恢复通道。 D X D低级别卷积带有密度,导致旋转操作显着降低,而1 x 1卷积的旋转开销接近零。据我们所知,FFCONV是能够同时减少densepack和Convpack产生的旋转头顶的第一项工作,而无需将其他特殊块引入HECNN推理管道。与先前的Art Lola和Falcon相比,我们的方法分别将推理潜伏期降低了88%和21%,其精度在MNIST和CIFAR-10上具有可比的精度。
translated by 谷歌翻译
对人类姿势和行动的认可对于自治系统与人们顺利互动。然而,相机通常在2D中捕获人类的姿势,作为图像和视频,这在跨越识别任务具有挑战性的观点来具有显着的外观变化。为了解决这个问题,我们探讨了来自2D信息的3D人体姿势中的识别相似性,在现有工作中没有得到很好地研究。在这里,我们提出了一种从2D主体关节键盘学习紧凑型视图 - 不变的嵌入空间的方法,而不明确地预测3D姿势。通过确定性映射难以代表预测和遮挡的2D姿势的输入模糊,因此我们采用了嵌入空间的概率制定。实验结果表明,与3D姿态估计模型相比,我们的嵌入模型在不同相机视图中检索类似的姿势时达到更高的准确性。我们还表明,通过培训简单的时间嵌入模型,我们在姿势序列检索方面取得了卓越的性能,并大大减少了基于堆叠帧的嵌入式的嵌入维度,以实现高效的大规模检索。此外,为了使我们的嵌入能够使用部分可见的输入,我们进一步调查培训期间的不同关键点遮挡增强策略。我们证明这些遮挡增强显着提高了部分2D输入姿势的检索性能。行动识别和视频对齐的结果表明,使用我们的嵌入没有任何额外培训,可以实现相对于每个任务专门培训的其他模型的竞争性能。
translated by 谷歌翻译
Graph representation learning has emerged as a powerful technique for addressing real-world problems. Various downstream graph learning tasks have benefited from its recent developments, such as node classification, similarity search, and graph classification. However, prior arts on graph representation learning focus on domain specific problems and train a dedicated model for each graph dataset, which is usually non-transferable to out-of-domain data. Inspired by the recent advances in pre-training from natural language processing and computer vision, we design Graph Contrastive Coding (GCC) 1 -a self-supervised graph neural network pre-training framework-to capture the universal network topological properties across multiple networks. We design GCC's pre-training task as subgraph instance discrimination in and across networks and leverage contrastive learning to empower graph neural networks to learn the intrinsic and transferable structural representations. We conduct extensive experiments on three graph learning tasks and ten graph datasets. The results show that GCC pre-trained on a collection of diverse datasets can achieve competitive or better performance to its task-specific and trained-from-scratch counterparts. This suggests that the pre-training and fine-tuning paradigm presents great potential for graph representation learning.
translated by 谷歌翻译
Recent years have witnessed the emerging success of graph neural networks (GNNs) for modeling structured data. However, most GNNs are designed for homogeneous graphs, in which all nodes and edges belong to the same types, making them infeasible to represent heterogeneous structures. In this paper, we present the Heterogeneous Graph Transformer (HGT) architecture for modeling Web-scale heterogeneous graphs. To model heterogeneity, we design node-and edge-type dependent parameters to characterize the heterogeneous attention over each edge, empowering HGT to maintain dedicated representations for different types of nodes and edges. To handle dynamic heterogeneous graphs, we introduce the relative temporal encoding technique into HGT, which is able to capture the dynamic structural dependency with arbitrary durations. To handle Web-scale graph data, we design the heterogeneous mini-batch graph sampling algorithm-HGSampling-for efficient and scalable training. Extensive experiments on the Open Academic Graph of 179 million nodes and 2 billion edges show that the proposed HGT model consistently outperforms all the state-of-the-art GNN baselines by 9%-21% on various downstream tasks. The dataset and source code of HGT are publicly available at https://github.com/acbull/pyHGT.
translated by 谷歌翻译
Since the invention of word2vec [28,29], the skip-gram model has significantly advanced the research of network embedding, such as the recent emergence of the DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all of the aforementioned models with negative sampling can be unified into the matrix factorization framework with closed forms. Our analysis and proofs reveal that: (1) DeepWalk [31] empirically produces a low-rank transformation of a network's normalized Laplacian matrix; (2) LINE [37], in theory, is a special case of DeepWalk when the size of vertices' context is set to one; (3) As an extension of LINE, PTE [36] can be viewed as the joint factorization of multiple networks' Laplacians; (4) node2vec [16] is factorizing a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk. We further provide the theoretical connections between skip-gram based network embedding algorithms and the theory of graph Laplacian. Finally, we present the NetMF method 1 as well as its approximation algorithm for computing network embedding. Our method offers significant improvements over DeepWalk and LINE for conventional network mining tasks. This work lays the theoretical foundation for skip-gram based network embedding methods, leading to a better understanding of latent network representation learning.
translated by 谷歌翻译
Autonomous vehicles must often contend with conflicting planning requirements, e.g., safety and comfort could be at odds with each other if avoiding a collision calls for slamming the brakes. To resolve such conflicts, assigning importance ranking to rules (i.e., imposing a rule hierarchy) has been proposed, which, in turn, induces rankings on trajectories based on the importance of the rules they satisfy. On one hand, imposing rule hierarchies can enhance interpretability, but introduce combinatorial complexity to planning; while on the other hand, differentiable reward structures can be leveraged by modern gradient-based optimization tools, but are less interpretable and unintuitive to tune. In this paper, we present an approach to equivalently express rule hierarchies as differentiable reward structures amenable to modern gradient-based optimizers, thereby, achieving the best of both worlds. We achieve this by formulating rank-preserving reward functions that are monotonic in the rank of the trajectories induced by the rule hierarchy; i.e., higher ranked trajectories receive higher reward. Equipped with a rule hierarchy and its corresponding rank-preserving reward function, we develop a two-stage planner that can efficiently resolve conflicting planning requirements. We demonstrate that our approach can generate motion plans in ~7-10 Hz for various challenging road navigation and intersection negotiation scenarios.
translated by 谷歌翻译